The analysis is based on an unknown sample obtained from either mice or human tissues. After tissue dissociation, cells were sorted by FACS and scRNA-seq libraries were prepared with the Smart-Seq2 protocol/platform.
library(tidyverse)
library(Seurat)
library(patchwork)
library(dplyr)
library(SingleR)
library(celldex)
library(SummarizedExperiment)
library(HGNChelper) #contains functions for identifying and correcting HGNC human gene symbols and MGI mouse gene symbols
library(openxlsx)
The starting point is a digital count matrix with mouse or human
genes as features. The analysis will focus on the XRA.rds
file.
load("./data/XRA.rds") # count matrix
A Seurat object is created to store both the count matrix and the following analysis steps (e.g. PCA).
XRA <- CreateSeuratObject(counts = counts, # give in input the count matrix
project = "XRA", # name of the project
min.cells = 3, # filter for genes (rows)
# keep cells that have genes expressed in at least 3 cells
min.features = 50 # filter for cells (columns)
# keep cells that have at least 50 features
)
XRA
## An object of class Seurat
## 42740 features across 5043 samples within 1 assay
## Active assay: RNA (42740 features, 0 variable features)
## 1 layer present: counts
The analysis is performed using Seurat v. 5.1.0. This version allows
to store data in layers. In layer="counts" are stored the
raw and un-normalized counts. With the following command, it is possible
to check the data and try to access the count matrix.
LayerData(XRA, assay="RNA", layer="counts")[500:505, 1:30]
## 6 x 30 sparse Matrix of class "dgCMatrix"
## [[ suppressing 30 column names 'GJOXFX', 'YRQBFE', 'RJUMSZ' ... ]]
##
## CELA3A . . . . . . . . . . . . . . . . . . . . . . .
## LINC01635 . . . . . . . . . . . . . . . . . . . . . . .
## LINC00339 . . . . . . . . . . . . . . . . . . . . . . .
## CDC42 . 1800 42 . 727 . . 11 1 410 . 609 . 1239 902 . 767 9 . . 8 219 2
## AL031281.2 . . . . . . . . . . . . . . . . . . . . . . .
## CDC42-IT1 . . . . . . . . . . . . . . . . . . . . . . .
##
## CELA3A . . . . . . .
## LINC01635 . . . . . . .
## LINC00339 . . . . . . 303
## CDC42 1792 . 234 . 940 1692 1816
## AL031281.2 . . . . . . .
## CDC42-IT1 . . . . . . .
The name of the genes are reported in uppercase, so it is possible to assume that the unknown sample is obtained from human.
To obtain the cleaned matrix of data, the standard workflow of quality control and filtering was used.
The filtering based on the percentage of mitochondrial genes aims to filter out genes with increased number of genes that map to a mitochondrial genome. This is necessary because low-quality or dying cells often exhibit extensive mitochondrial contamination.
XRA[["percent_mt"]] <- PercentageFeatureSet(XRA, pattern = "^MT-")
max(XRA[["percent_mt"]])
## [1] 93.63296
As additional quality control, the filtering based on spike-in RNAs starting with “ERCC” is performed. (ERCC stands for External RNA Controls Consortium.)
XRA[["percent_ERCC"]] <- PercentageFeatureSet(XRA, pattern = "^ERCC")
max(XRA[["percent_ERCC"]])
## [1] 0.772809
It is possible to visualize the quality control metrics as a violin plot.
p1<- VlnPlot(XRA, # object in input
features = c("nFeature_RNA", "nCount_RNA", "percent_mt", "percent_ERCC"),
# features from metadata that we want to show
ncol = 4, # number of columns to show in the plot
pt.size = 0.01) # dot size
p1
By looking at the violin plot showing the number of features, it is possible to see some outliers cells with high levels of number of features. This cells might be doublets or multiplets. There are also cells with a very low number of features. These might be dying cells.
By looking at the percentage of mitochondria, we see some cells that are outliers (with high levels of mitochondrial genes). So, in this case, we can discard these cells because they are dying cells, low quality cells or empty droplets.
To inspect the relationship between number of counts and percentage
of mitochondrial genes and number of features, the function
FeatureScatter() was used.
plot1 <- FeatureScatter(XRA, feature1 = "nCount_RNA", feature2 = "percent_mt")
# specify the two features we want to correlate
plot2 <- FeatureScatter(XRA, feature1 = "nCount_RNA", feature2 = "nFeature_RNA")
plot1 + plot2
Based on the Violin plots, the filtering will be performed by setting the following thresholds:
XRA <- subset(XRA,
subset = nFeature_RNA < 3000 &
percent_mt < 15 &
percent_ERCC < 0.1
)
After the filtering of the dataset, the workflow proceed with the normalization.
By default, Seurat employs a global-scaling normalization method
LogNormalize that normalizes the feature expression
measurements for each cell by the total expression, multiplies this by a
scale factor (10,000 by default), and log-transforms the result (to
reduce skewness). Normalized values are stored in the data layer
(XRA[["RNA"]]$data).
XRA <- NormalizeData(XRA,
normalization.method = "LogNormalize",
scale.factor = 10000)
## Normalizing layer: counts
XRA[["RNA"]]$data[1:10,1:30] # inspect normalized counts
## 10 x 30 sparse Matrix of class "dgCMatrix"
## [[ suppressing 30 column names 'RJUMSZ', 'JUYFQD', 'UONUAM' ... ]]
##
## DDX11L1 . . . . . . . . . . . . . . . . . . . . . . . . .
## WASH7P . . . . 0.03539945 . . . . . . . . 0.00563619 . . . . . . . . . . .
## MIR6859-1 . . . . . . . . . . . . . . . . . . . . . . . . .
## MIR1302-2HG . . . . . . . . . . . . . . . . . . . . . . . . .
## OR4F5 . . . . . . . . . . . . . . . . . . . . . . . . .
## AL627309.1 . . . . . . . . . . . . . 0.01681416 . . . . . . . . . . .
## CICP27 . . . . . . . . . . . . . . . . . . . . . . . . .
## AL627309.6 . . . . . . . . . . . . . . . . . . . . . . . . .
## AL627309.7 . . . . . . . . . . . . . . . . . . . . . . . . .
## AL627309.5 . . . . . . . . . . . . . . . . . . . . . . . . .
##
## DDX11L1 . . . . .
## WASH7P . . . . .
## MIR6859-1 . . . . .
## MIR1302-2HG . . . . .
## OR4F5 . . . . .
## AL627309.1 . . . . .
## CICP27 . . . . .
## AL627309.6 . . . . .
## AL627309.7 . . . . .
## AL627309.5 0.01199446 . . . .
When this normalization method is applied, the assumption is that each cell should have the same number of reads, so anytime a different number of reads is observed, that difference is technical and not biological, and it needs to be corrected. However, this assumption is questionable, especially in single-cell analysis, because it is known that cells donʼt have the same number of transcripts (it depends on the cell type).
After the log normalization, the workflow continues with the selection of a subset of features (genes) that exhibit high cell-to-cell variation in the dataset (i.e, they are highly expressed in some cells, and lowly expressed in others). Focusing on these genes in downstream analysis helps to highlight biological signal in single-cell datasets.
The procedure to select variable features is implemented in the
FindVariableFeatures function (the procedure models the
mean-variance relationship inherent in single-cell data). By default,
the function returns the 2,000 most variable features per dataset. These
will be used in downstream analysis, like PCA.
XRA <- FindVariableFeatures(XRA,
selection.method = "vst",
nfeatures = 2000)
## Finding variable features for layer counts
# Identify the 10 most highly variable genes
top10 <- head(VariableFeatures(XRA), 10)
The top 10 variable features are:
top10
## [1] "ACTA1" "CKM" "TPSB2" "TPSAB1" "CXCL8" "TNNC2" "PLA2G2A"
## [8] "C7" "MYL2" "ENO3"
# plot variable features with labels
plot1 <- VariableFeaturePlot(XRA)
plot2 <- LabelPoints(plot = plot1, points = top10, repel = TRUE)
## When using repel, set xnudge and ynudge to 0 for optimal results
plot2
## Warning in scale_x_log10(): log-10 transformation introduced infinite values.
## Warning: ggrepel: 1 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
In the plot reported above, each red dot is a highly variable feature.
By scaling, Seurat applies a linear transformation to the expression levels of each gene, that is a standard pre-processing step prior to dimensional reduction techniques like PCA.
The ScaleData function:
Shifts the expression of each gene, so that the mean expression across cells is 0
Scales the expression of each gene, so that the variance across cells is 1 (z-score transformation)
This step gives equal weight in downstream analyses, so that highly-expressed genes do not dominate
The results of this are stored in
XRA[["RNA"]]$scale.data.
all_genes <- rownames(XRA) # perform scaling on all genes (by default, only the top 2000 are scaled)
XRA <- ScaleData(XRA,
features = all_genes) # specify which features to scale
## Centering and scaling data matrix
XRA[["RNA"]]$scale.data[1:5,1:10]
## RJUMSZ JUYFQD UONUAM TSMWLG TFSDGD
## DDX11L1 -0.03302852 -0.03302852 -0.03302852 -0.03302852 -0.03302852
## WASH7P -0.07904836 -0.07904836 -0.07904836 -0.07904836 0.46208668
## MIR6859-1 -0.03125252 -0.03125252 -0.03125252 -0.03125252 -0.03125252
## MIR1302-2HG -0.01635612 -0.01635612 -0.01635612 -0.01635612 -0.01635612
## OR4F5 -0.01748152 -0.01748152 -0.01748152 -0.01748152 -0.01748152
## GBQWUU LKVPRJ HZQFDM YPZTJL ZMMETL
## DDX11L1 -0.03302852 -0.03302852 -0.03302852 -0.03302852 -0.03302852
## WASH7P -0.07904836 -0.07904836 -0.07904836 -0.07904836 -0.07904836
## MIR6859-1 -0.03125252 -0.03125252 -0.03125252 -0.03125252 -0.03125252
## MIR1302-2HG -0.01635612 -0.01635612 -0.01635612 -0.01635612 -0.01635612
## OR4F5 -0.01748152 -0.01748152 -0.01748152 -0.01748152 -0.01748152
An alternative normalization method could be
sctransform. This method was introduced in Hafemeister
and Satija, Genome Biology 2019 and with its procedure omits the
need for heuristic steps including pseudocount addition or
log-transformation. This results in an improvement of the common
downstream analytical tasks such as variable gene selection, dimensional
reduction, and differential expression.
SCTransform is a single command that replaces
NormalizeData(), ScaleData(), and
FindVariableFeatures(). Transformed data are available in
the SCT assay in the SEURAT object, which is set as the default after
running sctransform.
In the following chunks is reported the alternative workflow. However, the SCTransform method will not be used.
XRA <- SCTransform(XRA, verbose = FALSE)
XRA[["SCT"]]@scale.data[1:5,1:10]
The presence of cell cycle within scRNAseq data may introduce variability. If the experiment does not focus on the cell cycle, strategies can be employed to address this heterogeneity.
In Seurat, handling cell cycle effects involves computing phase scores using established markers and then removing them during data pre-processing to alleviate the impact of cell cycle heterogeneity on scRNA-seq data.
A list of cell cycle markers for the S and G2/M phases, derived from Tirosh et al, 2015, is loaded with Seurat. This list can be segregated into markers of G2/M phase and markers of S phase. These two list of markers (denoting G2/M and S cell phase) will be used to find matches against the cell from XRA dataset.
s_genes <- cc.genes$s.genes # replication phase
g2m_genes <- cc.genes$g2m.genes # division phase
First, each cell is assigned a score, based on its expression of G2/M and S phase markers. These marker sets should be anticorrelated in their expression levels, and cells expressing neither are likely not cycling and in G1 phase.
Scores are assigned in the CellCycleScoring function,
which stores S and G2/M scores in object meta data, along with the
predicted classification of each cell in either G2M, S or G1 phase. The
results are stored in the meta data.
CellCycleScoring can also set the identity of the Seurat
object to the cell-cycle phase by passing set.ident = TRUE
(the original identities are stored as old.ident). Please
note that Seurat does not use the discrete classifications (G2M/G1/S) in
downstream cell cycle regression. Instead, it uses the quantitative
scores for G2M and S phase. However, the predicted classifications are
provided in case they are of interest.
cell_cycle <- CellCycleScoring(XRA,
s.features = s_genes, # markers S phase
g2m.features = g2m_genes, # markers G2M phase
set.ident = TRUE) # change identity from cluster to cell cycle
# view cell cycle scores and phase assignments
head(cell_cycle[[]])
## orig.ident nCount_RNA nFeature_RNA percent_mt percent_ERCC S.Score
## RJUMSZ XRA 1332675 1482 4.092296 0.0000000000 -0.05881237
## JUYFQD XRA 169167 780 7.389739 0.0000000000 -0.03958919
## UONUAM XRA 813984 1549 6.513027 0.0000000000 -0.05306389
## TSMWLG XRA 1612853 1669 11.803370 0.0060761892 -0.01453037
## TFSDGD XRA 1110079 1785 5.572126 0.0000000000 -0.06242548
## GBQWUU XRA 915870 1476 4.183454 0.0002183716 -0.02143424
## G2M.Score Phase old.ident
## RJUMSZ -0.01538705 G1 XRA
## JUYFQD 0.01164100 G2M XRA
## UONUAM 0.03736741 G2M XRA
## TSMWLG -0.06770569 G1 XRA
## TFSDGD -0.02714501 G1 XRA
## GBQWUU -0.06982177 G1 XRA
# convert Seurat Object to Data frame
df_XRA <- as.data.frame(cell_cycle[[]], genes = Seurat::VariableFeatures(cell_cycle[[]]), fix_names = TRUE)
RidgePlot(cell_cycle,
features = c("NASP", "SLBP", "TUBB4B", "ANP32E"),
ncol = 2)
## Picking joint bandwidth of 0.101
## Picking joint bandwidth of 0.0807
## Picking joint bandwidth of 0.149
## Picking joint bandwidth of 0.0661
The Ridge Plot above shows the expression levels of 2 marker genes of S phase (NASP and SLBP) and 2 marker genes of G2M phase.
Then, the PCA on the cell cycle genes is performed.
cell_cycle <- RunPCA(cell_cycle, features = c(s_genes, g2m_genes))
## Warning in PrepDR5(object = object, features = features, layer = layer, : The
## following features were not available: MLF1IP, FAM64A, HN1.
## Warning in irlba(A = t(x = object), nv = npcs, ...): You're computing too large
## a percentage of total singular values, use a standard svd instead.
## PC_ 1
## Positive: ANP32E, CKS2, TUBB4B, NASP, SLBP, HMGB2, RPA2, PCNA, SMC4, RRM1
## CBX5, CTCF, MCM6, TMPO, GMNN, LBR, USP1, CASP8AP2, CKS1B, UNG
## RANGAP1, MCM5, TACC3, RFC2, NUSAP1, CKAP2, UBR7, NCAPD2, HELLS, MCM4
## Negative: TIPIN, CCNB2, UHRF1, TTK, RAD51, CDCA7, E2F8, ANLN, CDC25C, KIF23
## BRIP1, HJURP, KIF2C, CENPA, AURKB, POLA1, CLSPN, DTL, MSH2, EXO1
## GTSE1, GINS2, NEK2, NUF2, CDC45, CDCA2, DLGAP5, CDC20, PRIM1, PSRC1
## PC_ 2
## Positive: HMGB2, SLBP, LBR, MCM6, CTCF, USP1, CKAP2, KIF20B, SMC4, NUSAP1
## ANP32E, TACC3, NDC80, ATAD2, PCNA, HELLS, TMPO, CKAP5, RANGAP1, CCNE2
## CENPE, WDR76, DLGAP5, GINS2, MKI67, CLSPN, TOP2A, HMMR, DTL, POLD3
## Negative: TUBB4B, CKS2, MCM5, RFC2, RRM1, UNG, RPA2, GAS2L3, CBX5, CKS1B
## UBR7, ECT2, G2E3, GMNN, TIPIN, NASP, CDC6, MSH2, POLA1, CHAF1B
## BIRC5, TYMS, MCM2, UBE2C, CDK1, CDC20, PSRC1, CENPA, KIF2C, CENPF
## PC_ 3
## Positive: SLBP, CKS1B, RANGAP1, CKS2, HMGB2, CBX5, GMNN, NCAPD2, MCM5, RPA2
## MCM6, GAS2L3, TACC3, NUSAP1, CTCF, MCM4, CENPF, CHAF1B, CDCA3, MKI67
## BIRC5, HMMR, TPX2, CDCA2, BLM, CKAP2L, TOP2A, UBE2C, BUB1, ANP32E
## Negative: TMPO, LBR, USP1, UBR7, CKAP5, RRM1, SMC4, NASP, POLD3, ECT2
## UNG, MSH2, CKAP2, TUBB4B, PRIM1, RFC2, POLA1, KIF20B, G2E3, CASP8AP2
## TYMS, TIPIN, GINS2, DLGAP5, HELLS, WDR76, PCNA, CDC45, PSRC1, CCNE2
## PC_ 4
## Positive: MCM5, RANGAP1, CTCF, POLD3, GAS2L3, LBR, TMPO, SMC4, CKS2, NCAPD2
## MCM6, CKAP5, TUBB4B, CBX5, CDK1, TYMS, TACC3, FEN1, CENPF, ECT2
## PSRC1, CKAP2, CENPE, BLM, RPA2, HELLS, BIRC5, TPX2, CKAP2L, NUSAP1
## Negative: GMNN, CKS1B, RRM1, UNG, PRIM1, RFC2, G2E3, UBR7, USP1, CASP8AP2
## SLBP, PCNA, ANP32E, TIPIN, POLA1, NDC80, ATAD2, MSH2, NASP, KIF20B
## HMGB2, MCM4, GTSE1, CLSPN, CDC20, HJURP, CDCA7, GINS2, CDC45, NUF2
## PC_ 5
## Positive: CKS1B, PCNA, CTCF, CBX5, POLD3, G2E3, POLA1, SLBP, SMC4, RFC2
## UBR7, RPA2, CKAP5, NCAPD2, HELLS, MSH2, MCM5, CHAF1B, ATAD2, TMPO
## CENPE, DLGAP5, BUB1, FEN1, PSRC1, CDCA3, TOP2A, GAS2L3, UBE2C, TPX2
## Negative: CASP8AP2, HMGB2, UNG, RANGAP1, CKS2, TUBB4B, NASP, ANP32E, MCM6, BLM
## AURKA, GMNN, KIF20B, LBR, CDC45, CDK1, NDC80, TIPIN, RRM1, CCNE2
## TACC3, CLSPN, GINS2, HJURP, AURKB, GTSE1, USP1, NEK2, CDC6, CENPF
DimPlot(cell_cycle)
By looking at the PCA, it is possible to observe that cells in G1 are clustered together, while there is a overlap between cells in G2M phase and cells in S phase.
The analysis aims to subtract (‘regress out’) this source of heterogeneity from the data. For each gene, Seurat models the relationship between gene expression and the S and G2M cell cycle scores. The scaled residuals of this model represent a ‘corrected’ expression matrix, that can be used downstream for dimensional reduction.
After this regression, a PCA on the variable genes no longer returns components associated with cell cycle. ALERT (long time step)
cell_cycle <- ScaleData(cell_cycle,
vars.to.regress = c("S.Score", "G2M.Score"),
features = rownames(cell_cycle))
## Regressing out S.Score, G2M.Score
## Centering and scaling data matrix
cell_cycle <- RunPCA(cell_cycle,
features = ,
nfeatures.print = 10)
## PC_ 1
## Positive: CD74, RNASE1, PECAM1, VWF, AQP1, THBD, ACKR1, TM4SF1, FABP5, ADAMTS9
## Negative: DCN, C1S, APOD, C3, ADH1B, COL6A3, CFD, GPX3, MFAP5, CILP
## PC_ 2
## Positive: CKM, ACTA1, ENO3, DES, MYL1, ANKRD2, SLN, MYBPC1, MYOZ1, MB
## Negative: CD74, TM4SF1, VWF, PECAM1, A2M, AQP1, ADAMTS9, THBD, ACKR1, IL6
## PC_ 3
## Positive: A2M, TM4SF1, ADAMTS1, VWF, ADAMTS9, ADAMTS4, AQP1, IL6, PECAM1, CCL2
## Negative: FCER1G, LYZ, IL1B, BCL2A1, CD163, S100A9, GPR183, CXCR4, MS4A7, C1QA
## PC_ 4
## Positive: NOTCH3, ACTA2, RGS5, GJA4, HIGD1B, TAGLN, GJC1, CDH6, LMOD1, SSTR2
## Negative: ACKR1, SELE, PLVAP, CD74, PLAT, RAMP3, CSF3, ICAM1, PECAM1, VWF
## PC_ 5
## Positive: CXCR4, CD69, GZMB, GNLY, IL7R, CTSW, PRF1, NKG7, MYF5, GZMH
## Negative: CD163, LYZ, IL1B, CXCL8, FCER1G, C1QA, F13A1, PLAUR, MS4A7, BCL2A1
DimPlot(cell_cycle)
By looking at the PCA, it is possible to observe that cells don’t separate by cell-cycle phase.
XRA <- cell_cycle
rm(s_genes,g2m_genes,cell_cycle)
Dimensionality reduction techniques are employed to reduce data complexity in downstream analyses (e.g. clustering) and for data visualisation.
Principal component analysis (PCA) is the most popular dimension reduction method. For each cell, the original dimensions correspond to the expression values of each gene.
PCA performs an orthogonal transformation of the original dataset to create a set of new, uncorrelated variables or principal components. These principal components are linear combinations of variables in the original dataset. The transformation is defined such that the principal components are ranked with decreasing order of variance. Thus the first principal component amounts to the largest possible variance.
The idea is to discard the principal components with the lowest variance, and effectively reduce the dimensions of the dataset without much loss of information.
By default, in Seurat only the previously determined variable features are used as input for PCA, but this can be defined using features argument if you wish to choose a different subset.
XRA <- RunPCA(XRA,
features = VariableFeatures(object = XRA), # predetermined variable features
verbose = F)
PCA is highly interpretable and computationally efficient, but it is inappropriate for scRNA-seq data visualization.
scRNA-seq data is sparse due to dropout events (weakly expressed genes are missed), meaning there are 60–80% zeroes in the data matrix. It is a highly non-linear structure, while PCA is a linear dimensional reduction technique and therefore deemed very inappropriate for data visualisation. PCA is only used to select ~top 10–50 principal components that can be processed with downstream applications like cluster analysis.
Other dimensionality reduction techniques are used for visualization (t-SNE, UMAP).
Seurat provides several ways of visualizing both cells and features
that define the PCA, including VizDimReduction,
DimPlot, and DimHeatmap. PCA results are
stored in XRA[["pca"]].
VizDimLoadings(XRA, dims = 1:2, reduction = "pca")
DimPlot(XRA, reduction = "pca")
DimHeatmap allows for easy exploration of the primary
sources of heterogeneity in a dataset, and can be useful when trying to
decide which PCs to include for further downstream analyses. Both cells
and features are ordered according to their PCA scores. Setting “cells”
to a number plots the ‘extreme’ cells on both ends of the spectrum,
which dramatically speeds plotting for large datasets.
DimHeatmap(XRA,
dims = 1:2, #number of dimensions
cells = 500,
balanced = TRUE)
To overcome the extensive technical noise in any single feature for scRNA-seq data, Seurat clusters cells based on their PCA scores, with each PC essentially representing a ‘metafeature’ that combines information across a correlated feature set. The top principal components therefore represent a robust compression of the dataset. However, it is needed to decide how many components should be included.
So, a heuristic method is used to decide the number of PC to
consider. This method generates an ‘Elbow plot’: a ranking of principle
components based on the percentage of variance explained by each one
(ElbowPlot function).
ElbowPlot(XRA)
In this case, the elbow can be observed around PC 6. This number will be used as input for the clustering step.
Seurat applies a graph-based clustering approach. The distance metric
which drives the clustering analysis is based on previously identified
PCs. The approach to partitioning the cellular distance matrix into
clusters is the following: cells are embedded in a graph structure - for
example a K-nearest neighbor (KNN) graph, with edges drawn between cells
with similar feature expression patterns, and then there is an attempt
to partition this graph into highly interconnected ‘quasi-cliques’ or
‘communities’. Seurat first constructs a KNN graph based on the
euclidean distance in PCA space, and refine the edge weights between any
two cells based on the shared overlap in their local neighborhoods
(Jaccard similarity). This step is performed using the
FindNeighbors function, and takes as input the previously
defined dimensionality of the dataset.
Then, to cluster the cells, modularity optimization techniques (such
as the Louvain algorithm (default)) are applied to iteratively group
cells together, with the goal of optimizing the standard modularity
function. The FindClusters function implements this
procedure, and contains a resolution parameter that sets the
‘granularity’ of the downstream clustering, with increased values
leading to a greater number of clusters.
The clusters can be found using the Idents function.
XRA <- FindNeighbors(XRA, reduction = "pca", dims = 1:6)
## Computing nearest neighbor graph
## Computing SNN
XRA <- FindClusters(XRA, resolution = 0.5)
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
##
## Number of nodes: 3738
## Number of edges: 113358
##
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.9100
## Number of communities: 9
## Elapsed time: 0 seconds
head(Idents(XRA), 5)
## RJUMSZ JUYFQD UONUAM TSMWLG TFSDGD
## 3 4 3 3 3
## Levels: 0 1 2 3 4 5 6 7 8
DimPlot(XRA, reduction = "pca", label = T)
Seurat offers several non-linear dimensional reduction techniques, such as tSNE and UMAP, to visualize and explore these datasets. Cells within the graph-based clusters determined above should co-localize on these dimension reduction plots. The same PCs as input to the clustering analysis are used as input to the UMAP and tSNE.
t-SNE is a graph based, non-linear dimensionality reduction technique. It projects high dimensional data onto 2D or 3D components. It works by calculating pairwise similarities between data points, emphasizing close similarities and using Gaussian distributions to define relationships. Then, it maps the data to a lower-dimensional space (typically two dimensions) while preserving these pairwise similarities. t-SNE employs an iterative optimization process to position data points in the lower-dimensional map, ensuring that similar data points remain close, while dissimilar ones are separated. This makes it a valuable tool for exploring and visualizing intricate data structures and identifying clusters or patterns within the data.
Pros
t-SNE powerfully captures the non-linearity in high dimensional datasets and is able to retain the local structures in low dimensions. This is a huge improvement over PCA. t-SNE has been used as a gold standard method for scRNA-seq data visualisation.
Cons
The way t-SNE works, it is impossible for it to preserve the global structure while performing dimension reduction. Only local structures are preserved, while the distances between groups are drastically different depending on the run.
XRA <- RunTSNE(XRA, dims = 1:6)
DimPlot(XRA, reduction = "tsne",label=T)
UMAP is a dimension reduction technique that can be used for visualisation similarly to t-SNE, but also for general non-linear dimension reduction. UMAP is a relatively new dimensional reduction technique introduced by McInnes et al in 2018. (McInnes, L, Healy, J, UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction, ArXiv e-prints 1802.03426, 2018)
The algorithm is graph based and principally similar to t-SNE where it constructs a high dimensional graph representation of the data, then optimizes a low-dimensional graph to be as structurally similar as possible.
Pros
Non linear datasets: UMAP is manifold learning dimension reduction technique and thus captures the non linearity of real world datasets. It is comparable to t-SNE in terms of data visualisation.
The mathematical improvements in UMAP allow superior run time performance over t-SNE
In comparison to t-SNE, UMAP offers better preservation of a data’s global structure.
Unlike t-SNE, UMAP has no computational restrictions on embedding dimensions and can be used as an effective pre-processing step to boost the performance of density based clustering algorithms.
Cons
Lacks interpretability: Unlike PCA, where the principal components are directions of greatest variance of the source data, the lower dimension embeddings of UMAP lack strong interpretability.
One of the core assumptions of UMAP is that there exists manifold structure in the data. Because of this, UMAP can tend to find manifold structure within the noise of a dataset.
XRA <- RunUMAP(XRA, dims = 1:6)
## Warning: The default method for RunUMAP has changed from calling Python UMAP via reticulate to the R-native UWOT using the cosine metric
## To use Python UMAP via reticulate, set umap.method to 'umap-learn' and metric to 'correlation'
## This message will be shown once per session
## 23:32:43 UMAP embedding parameters a = 0.9922 b = 1.112
## 23:32:43 Read 3738 rows and found 6 numeric columns
## 23:32:43 Using Annoy for neighbor search, n_neighbors = 30
## 23:32:43 Building Annoy index with metric = cosine, n_trees = 50
## 0% 10 20 30 40 50 60 70 80 90 100%
## [----|----|----|----|----|----|----|----|----|----|
## **************************************************|
## 23:32:43 Writing NN index file to temp file /var/folders/ry/9rv81k4s20xdgj8v_km36dxm0000gn/T//Rtmptfe9L6/file18691431e9b30
## 23:32:43 Searching Annoy index using 1 thread, search_k = 3000
## 23:32:44 Annoy recall = 100%
## 23:32:44 Commencing smooth kNN distance calibration using 1 thread with target n_neighbors = 30
## 23:32:45 Initializing from normalized Laplacian + noise (using RSpectra)
## 23:32:45 Commencing optimization for 500 epochs, with 147076 positive edges
## 23:32:49 Optimization finished
DimPlot(XRA, reduction = "umap",label=T)
Seurat can find markers that define clusters via differential
expression. By default, setting only ident.1, it identifes
positive and negative markers of a single cluster (specified in
ident.1), compared to all other cells.
FindAllMarkers automates this process for all clusters, but
you can also test groups of clusters vs. each other, or against all
cells.
The min.pct argument requires a feature to be detected
at a minimum percentage in either of the two groups of cells, and the
thresh.test argument requires a feature to be differentially expressed
(on average) by some amount between the two groups.
The default test used is the Wilcoxon Rank Sum test.
With the following chunks, the aim is to find markers for every clusters compared to all remaining cells. Only the positive ones are reported in a table.
XRA_markers <- FindAllMarkers(XRA,
only.pos = TRUE, # only positive markers
min.pct = 0.25, # test genes that are detected in minimum 25%
# of cells in either of the two population
logfc.threshold = 0.25) # threshold for log fold-change
## Calculating cluster 0
## For a (much!) faster implementation of the Wilcoxon Rank Sum Test,
## (default method for FindMarkers) please install the presto package
## --------------------------------------------
## install.packages('devtools')
## devtools::install_github('immunogenomics/presto')
## --------------------------------------------
## After installation of presto, Seurat will automatically use the more
## efficient implementation (no further action necessary).
## This message will be shown once per session
## Calculating cluster 1
## Calculating cluster 2
## Calculating cluster 3
## Calculating cluster 4
## Calculating cluster 5
## Calculating cluster 6
## Calculating cluster 7
## Calculating cluster 8
head(XRA_markers)
## p_val avg_log2FC pct.1 pct.2 p_val_adj cluster gene
## C3 0 4.116145 0.954 0.080 0 0 C3
## DCN 0 3.718785 0.999 0.126 0 0 DCN
## CILP 0 4.461804 0.921 0.053 0 0 CILP
## COL6A3 0 4.358773 0.957 0.102 0 0 COL6A3
## C1S 0 3.949805 0.989 0.134 0 0 C1S
## CFD 0 3.755910 0.961 0.110 0 0 CFD
Some of the columns present in the obtained data frame are:
p_val: p_val (unadjusted)
avg_log2FC: log fold-change of the average
expression between the two groups. Positive values indicate that the
feature is more highly expressed in the first group.
pct.1: The percentage of cells where the feature is
detected in the first group
pct.2: The percentage of cells where the feature is
detected in the second group
p_val_adj: Adjusted p-value, based on Bonferroni
correction using all features in the dataset.
XRA_positive_markers <- XRA_markers %>%
group_by(cluster) %>%
slice_max(n=2,order_by=avg_log2FC)
XRA_positive_markers
## # A tibble: 18 × 7
## # Groups: cluster [9]
## p_val avg_log2FC pct.1 pct.2 p_val_adj cluster gene
## <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <chr>
## 1 1.12e-137 5.10 0.25 0.011 4.80e-133 0 WNT5A
## 2 0 4.77 0.747 0.06 0 0 ABCA9
## 3 9.00e-187 9.20 0.278 0.002 3.85e-182 1 PRF1
## 4 0 8.58 0.509 0.006 0 1 CD2
## 5 0 6.15 0.687 0.012 0 2 SELP
## 6 0 6.03 0.576 0.029 0 2 VCAM1
## 7 3.13e-254 5.44 0.436 0.015 1.34e-249 3 BTNL9
## 8 6.66e- 86 4.90 0.319 0.051 2.85e- 81 3 CR381670.1
## 9 0 11.6 0.552 0.001 0 4 MYF5
## 10 4.43e-214 8.51 0.312 0.003 1.89e-209 4 DLK1
## 11 0 12.8 0.623 0.004 0 5 MS4A7
## 12 0 11.4 0.84 0.004 0 5 LYZ
## 13 5.99e- 10 3.29 0.473 0.34 2.56e- 5 6 MT1XP1
## 14 2.98e-131 3.01 0.905 0.279 1.27e-126 6 APOD
## 15 7.16e-274 11.0 0.387 0.003 3.06e-269 7 CDH6
## 16 4.10e-239 10.9 0.327 0.001 1.75e-234 7 AVPR1A
## 17 0 13.5 0.504 0.001 0 8 HSPB3
## 18 2.11e-249 12.6 0.372 0.002 9.03e-245 8 MYH7
Seurat offers several tools for visualizing marker expression.
In the violin plots, it is possible to specify which features we want to visualize. In the x-axis is reported the identity (the cluster), while on the y-axis, the expression level.
VlnPlot(XRA, features = c("APOD", "COL6A3"), pt.size = 0)
This function allows to see the expression levels of a particular gene in all the clusters. So, it is possible to use this visualization to identify marker specific for one cluster or another.
RidgePlot(XRA, features = c("APOD", "COL6A3"))
# save plot as pdf
pdf("img/15_RidgePlot.pdf")
RidgePlot(XRA, features = c("APOD", "COL6A3"))
dev.off()
## quartz_off_screen
## 2
This plot allows to see the expression of a particular feature in UMAP (so, in the dimensionality reduction space).
FeaturePlot(XRA, features = c("APOD", "COL6A3"), order =T )
Each cell is colored according to the expression of that particular feature.
Intuitive way of visualizing how feature expression changes across different identity classes (clusters). The size of the dot encodes the percentage of cells within a class, while the color encodes the Average Expression level of cells within a class (blue is high).
DotPlot(XRA,
features = c("WNT5A","ABCA9","PRF1","CD2","SELP","VCAM1","BTNL9","CR381670.1","MYF5","DLK1","MS4A7","LYZ","MT1XP1", "APOD", "CDH6", "AVPR1A","HSPB3", "MYH7") # features we want to observe
) + theme(axis.text.x = element_text(angle = 45, hjust = 1))
In this case, the top 3 markers (or all markers if less than 3) for each cluster are plotted.
top3 <- XRA_markers %>%
group_by(cluster) %>%
top_n(n = 3, # top 3 for each cluster
wt = avg_log2FC)
DoHeatmap(XRA, features = top3$gene) + NoLegend()
Seurat utilizes R’s plotly graphing library to create interactive plots.
plot <- FeaturePlot(XRA, features = "APOD")
HoverLocator(plot = plot, information = FetchData(XRA, vars = c("ident","percent_mt", "nFeature_RNA")))
The final step of the analysis consists on trying to annotate the cell with annotated dataset. To do that, the package SingleR will be used.
SingleR can be considered a robust variant of nearest-neighbors classification, with some tweaks to improve resolution for closely related labels.
To work, it needs:
Reference set: A comprehensive transcriptomic dataset (single-cell or bulk) of pure cell types, preferably with multiple samples per cell type. In this case, will be loaded a reference set for human and one for mice.
Single-cell set: Single-cell RNA-seq dataset with unknown cell types. In this case, XRA will be the unknown dataset.
SingleR runs in two modes:
Single-cell: the annotation is performed for each single-cell independently.
Cluster: the annotation is performed on predefined clusters, where the expression of a cluster is the sum expression of all cells in the cluster. Faster and computationally efficient, but the results are not so reliable to respect to the single-cell approach.
The approach can be divided into 3 main steps:
Spearman coefficient is calculated for single-cell expression with each of the samples in the reference dataset. The correlation analysis is performed only on variable genes in the reference dataset.
Multiple correlation coefficients per cell types are aggregated to provide a single value per cell type per single-cell.
In this step SingleR reruns the correlation analysis, but only for the top cell types from step 2. The analysis is performed only on variable genes between these cell types. The lowest value cell type is removed, and then this step is repeated until only two cell types remain. The cell type corresponding to the top value after the last run is assigned to the single-cell.
In the XRA dataset, all gene symbols have letters in upper case, meaning that they are from human. So, in the following analysis is reported the annotation with human reference.
# load reference human
ref.human <- celldex::HumanPrimaryCellAtlasData()
ref.human
## class: SummarizedExperiment
## dim: 19363 713
## metadata(0):
## assays(1): logcounts
## rownames(19363): A1BG A1BG-AS1 ... ZZEF1 ZZZ3
## rowData names(0):
## colnames(713): GSM112490 GSM112491 ... GSM92233 GSM92234
## colData names(3): label.main label.fine label.ont
#ref.human$label.main[1:50]
To continue the analysis, it is necessary to convert the XRA dataset into a SingleCellExperiment object.
XRA_exp <- as.SingleCellExperiment(DietSeurat(XRA))
pred_human <- SingleR(test=XRA_exp,
clusters = XRA@meta.data$seurat_clusters,
# vector of cluster identities for each cell
ref=ref.human,
labels=ref.human$label.main,
fine.tune = TRUE, prune = TRUE)
pred_human_df <- data.frame(cluster=rownames(pred_human),
labels=pred_human$labels)
pred_human_df
## cluster labels
## 1 0 Tissue_stem_cells
## 2 1 T_cells
## 3 2 Endothelial_cells
## 4 3 Endothelial_cells
## 5 4 Tissue_stem_cells
## 6 5 Monocyte
## 7 6 Tissue_stem_cells
## 8 7 Tissue_stem_cells
## 9 8 Tissue_stem_cells
Once assigned a label, plotScoreHeatmap can be used to
visualize the results. plotScoreHeatmap() displays the
correlation-based scores for all clusters across all reference labels.
Each cluster is a column while each row is a label in the reference
dataset.
heatmap1 <- plotScoreHeatmap(pred_human)
Then, we add the annotation obtained to the original Seurat object to later plot the clusters with the labels.
XRA[["SingleR_human_labels"]] <- pred_human_df$labels[match(XRA@meta.data$seurat_clusters, pred_human_df$cluster)]
DimPlot(XRA, group.by = "SingleR_human_labels")
ScType accepts both positive and negative markers, i.e., gene that are not expected to be expressed in a particular cell type. Sctype provides its own marker database for human and mouse, obtained from the integration of the information available in the CellMarker database and PanglaoDB.
scType cell_type annotation:
For each positive/negative marker compute specificity score, which indicate whether a gene is a marker for a specific cell types.
The raw expression matrix is normalized and Z-transform (scale the expression of each gene across cells)
The transformed matrix is multiply by the cell-type specificity score
For each cell types the expression scores of all its positive markers are summarized into a single enrichment score by summing them and dividing by square root of their number. The same is done for the negative markers.
The negative marker expression score is subtracted from the positive score to obtain the final enrichment score. Individual cells are assigned to a cell type based on the maximum value for the cell type marker set.
By default, scType use the in-built marker DB.
# DB file
db_ <- "https://raw.githubusercontent.com/IanevskiAleksandr/sc-type/master/ScTypeDB_full.xlsx";
tissue <- c("Brain","Immune system","Pancreas","Liver","Eye","Kidney","Brain","Lung","Adrenal","Heart","Intestine","Muscle","Placenta","Spleen","Stomach","Thymus")
# load gene set preparation function
source("https://raw.githubusercontent.com/IanevskiAleksandr/sc-type/master/R/gene_sets_prepare.R")
# prepare gene sets
gs_list <- gene_sets_prepare(db_, tissue)
The function sctype_score() performs the scoring
process. It takes in input the scaling data (already preprocessed), even
if we can give in input the raw counts by specifying
scaled = FALSE.
# load cell type annotation function
source("https://raw.githubusercontent.com/IanevskiAleksandr/sc-type/master/R/sctype_score_.R")
es.max <- sctype_score(scRNAseqData = XRA[["RNA"]]$scale.data,
scaled = TRUE,
# specify the list of markers to use with gs and gs2
gs = gs_list$gs_positive, # positive markers
gs2 = gs_list$gs_negative) # negative markers
# in case there are no negative markers just set gs2 = NULL
# merge by cluster
cL_resutls <- do.call("rbind",
lapply(unique(XRA@meta.data$seurat_clusters),
function(cl){
es.max.cl = sort(rowSums(es.max[ ,rownames(XRA@meta.data[XRA@meta.data$seurat_clusters==cl, ])]), decreasing = !0)
head(data.frame(cluster = cl,
type = names(es.max.cl),
scores = es.max.cl,
ncells = sum(XRA@meta.data$seurat_clusters==cl)), 10)
}))
sctype_scores <- cL_resutls %>% group_by(cluster) %>% top_n(n = 1, wt = scores)
The annotation obtained is then added to the meta data by creating
another column called scType_labels. Then, the results are
visualized.
# add annotation to meta data
XRA[["scType_labels"]] <- sctype_scores$type[match(XRA@meta.data$seurat_clusters, sctype_scores$cluster)]
# visualize the results
DimPlot(XRA, reduction = "umap", label = TRUE, repel = TRUE, group.by = 'scType_labels')
ScType provides a functionality for automated guessing
of a tissue type given in input a unknown dataset.
# load auto-detection function
source("https://raw.githubusercontent.com/IanevskiAleksandr/sc-type/master/R/auto_detect_tissue_type.R")
db_ <- "https://raw.githubusercontent.com/IanevskiAleksandr/sc-type/master/ScTypeDB_full.xlsx";
# guess a tissue type
tissue_guess <- auto_detect_tissue_type(path_to_db_file = db_,
seuratObject = XRA,
scaled = TRUE,
assay = "RNA")
# if scaled = TRUE, make sure the data is scaled, as seuratObject[[assay]]@scale.data is used. If you just created a Seurat object, without any scaling and normalization, set scaled = FALSE, seuratObject[[assay]]@counts will be used
However, this chunck generates the following error and it wasn’t possible to infer the tissue type of the dataset.
Error in Z[cell_markers_genes_score[jj, "gene_"], ] : subscript out of bounds
As it can be observed from the plots reported below, the ScType annotation results to be more precise and able to assign a label to each cluster.
DimPlot(XRA, group.by = "SingleR_human_labels")
DimPlot(XRA, reduction = "umap", label = TRUE, repel = TRUE, group.by = 'scType_labels')